Lab 1 - Used Cars in the USA

By: David Wei, Sophia Wu, Dhruba Dey, Queena Wang

Business Understanding:

As a team, we decided to work on a dataset which can easily be understood from our common life experience without needing specific domain knowledge such as finance, marketing and so forth. Secondly, we wanted to dabble ourselves with a dataset, that we had not previously exposed to, such as Real Estate, Life expectancy, etc. Off course, we must meet the project guidelines of 30,000 rows and 10 features, which in fact prompted us to look for somewhat large dataset with multiple features so that can effectively cull right set of features having numerical, boolean and categorical values. In addition, working with a large dataset would provide us with the experience that we have not acquired yet. This thought process of data selection has led us to grab “Used Car Dataset”, which has 3 million rows and 66 features.

We wanted to build 3 different types of models on this dataset. First is Regression, second binary classification using Logistic Regression and third multiple classification using either K-nearest or K-means Random Forest depending on the best fit. We picked “Price” as the response feature for Regression model because it has numerical value and a pivotal point in the used car purchase. For Logistic Regression, we opted for “has_accidents” as the response variable because it has boolean values and an interesting point to observe whether we can effectively classify the used car with accidents. Finally, we used “body_type” as the target variable for multi-value classification. “Body_type” has 9 attributes which we believed a good candidate for multi-class classification.

We applied first our domain knowledge, which is our collective experience of used car buying, to mine features that we thought would be relevant for making decision on a used car purchase. That helped us reduce the numbers of features from 66 to 40. Then we took a close at each feature and tried to eliminate as many as possibleto keep the data size manageable so that we could handle data wrangling and model building, to be performed in the next phase, with the computing resources available to us. Also, we provided the justification as to why a feature was removed from the list of 40. However, we were cautious about categorical variable because itwould spawn to multiple variables after encoding. That is the key reason why we restricted the categorical feature to one. In the final dataset, we have 19 variables comprising 15 numerical, 1 boolean and 3 categorical. By the way, removing the missing values after sub-setting the data set to the intended features, we reduced the numbers of rows close to 700K from 3M. And we would also conduct PCA and Regularizations for dimensionality reduction. Based on the analysis and understanding of the data, we believed we selected the right set of data for the purpose of this project.

The general approach for the project is to build the model on the training data and cross-validate the model on the test data to find it effectiveness. We would use cross-validation technics such as k-fold etc. as appropriate. As noted earlier, we will use the Regression technic for “Price” prediction. And the effectiveness measures to be used to validate the Regression model are RMSE and CV Press. The Area Under the ROC curve (AUC) is an aggregated metric that evaluates how well a logistic regression model classifies positive and negative outcomes at all possible cutoffs. Our objective is to find the right cut-off to maximize the area under ROC curve (AUC). Off course, we will present the confusion matrix to know the misclassification rate, recall and precision. Similarly, for multi-class classification, we will use accuracy, recall and precision matrix to evaluate the effectiveness of the model. Also, we will keep an eye on sensitivity and specificity so that we do not want to build a model which is very either sensitive or specific.

NOTE: need to change file per user

Data Meaning Type

Domain based Attribute Reduction

Before we do a deep dive into the types of our data, we will first look into reducing it from a The total dataset has 66 attributes After a quick observation of the column headers, we can deduce that not all columns will be necessary for our analysis. Reasons for removing them below:

subsetting columns by referencing the column indexes

Data Quality

After doing a quick profiling on some our identified columsn, we can see that both engine_cylinders and engine_type are the same. Additionally, we also found that the prefixes and suffixes attached to them are descriptive of it and thus not a continuous value.

removing city_fuel_economy since a quick vizualition of our dataset shows that ALL values are empty

We also found that 'engine_cylinders','engine_type' contains the same data in two separate columns, so we first test if this condition is true and if it is, we will remove one.

observing if 'engine_cylinders','engine_type' is the same data

We also found additional columns that were simply descriptions of another columns, for example. The 'wheel_system_display' attribute is simply a longer, more descriptive version of the 'wheel_system' attribute ("Front-Wheel Drive" vs "FWD"). For columns that follow this trend, we will remove the descriptive column from our dataset.

We also discussed that 'interior_color' attribute has 45,726 distinct color values and so due to the sheer volume and complexity (# of levels in this attribute). We also decided to remove it from our dataset since it is unrealistic for the type of modeling we are doing and would effect the overall performance of our model.

note this plot takes forever to run

(ggplot(df_cln_1)+aes(x="interior_color", y="price")+geom_line())

Another attribute we found that could be removed as the "listed_date" attribute. Since there was no other 'datetype' attributes available in our dataset, we didn't really see a point with keeping it as there was no other data reference to use it with. Additionally, the attribute 'daysonmarket' already pre-aggregated the number of days it took to sell a vehicle, which the listed date would've been used for otherwise. Because of this, we will also remove it from our dataframe.

At this point, we decided to take a break from subsetting our data and take a look at the amount of missing values in the current dataframe. Our intention in mind was to see what the data looked like after it was cleaned to further proceed reducing the amount of attributes we had.

Data Cleaning - Duplicates, Missing Data, Nulls

Now that our datatypes have been adjusted. We will work on cleaning up any empty data in our dataset. To begin, we will check if any of the VINs have duplicates in the dataset, since this the VIN is unique to a car we are expecting there shouldn't be, but in cases there is, we will remove it.

We can see that that there are duplicates even though the 'vin' should be specific and distinct to each car. There is a total of 80 records that contain duplicates. We will then remove these duplicates while keeping the 'first' record so that one copy of the duplcites will remain. We can see that after we cleaned these duplicates, our total # of records drop from 3,000,040 to 3,000,000.

After we have cleaned our duplicates. A quick visualization of our data shows that almost every column has empty values. We will focus on analyzing those that have a large amount of empty data (ex. Frame_Damaged, has_accidents, isCab, etc.)

visualizatin of our data BEFORE removing all rows with missing data

remove the missing values

checking the row counts of columns to see the missing rows

visualizatin of our data AFTER removing all rows with missing data

We decided that since our original dataset was large (10gb with 3million records), that instead of imputing data based on the mean or other statistical types, that deleting all records with empty values still returned 700,000 total records that not only provides plenty of data leftover, but also raw data that isn't imputed.

Data Cleaning - Datatypes

Obviously at this point we need to convert a few of our data columns to the appropriate data type by removing parts of the value string that we do not need such as "gal" in the ful_tank_volume

Finding all unique values per column to see what values we need to clean

finding all unique values per column to see what values we need to clean

Regarding fuel_tank_volume and maximum_seating, we can see that there appears to be a pattern in the suffixes, "gal" and "seats" accordingly. We will now remove them and then convert all values to numeric.

removing unecesary string values in columns then cleaning up any values that contain '--' and replacing it with NaN Lastly, converting the value first to a string type and then to a float type

doing a quick profile on the subsetted columns

We can now see that all of our column values have been adjusted to the correct datatypes. We will next proceed with cleaning up the remainder of our data attributes. A quick count on our datatypes shows us that we still have 19 categorical values, 14 numerical types and 2 booleans (true/false) left to work wikth. Since our ultimate goal is doing a regression and one classification model, we will next work on trimming down our 'object' categorical data types.

count of datatypes in current dataframe

After the team decided to choose 'body_type' as main the main classification term. We decided to create a final dataframe with only the attributes we found useful for our model which excludes:

creating final df for analysis

Simple Statistics

Look at the final data, we have total 18 columns, their data types are: bool(1), float64(12), int64(2), object(3).

As our main goal is to predict car price and car type,in the simple statistics we need to find out the 5 most significant attributes that affect the car price, which's divided to below 5 steps:

1.Check the data range, mode, mean, median, variance and counts, etc
2.Check data normality
3.Correlation check
4.With the result from correlation checking,the 5 most significan attributes are horsepower(0.629),mileage(-0.43),highway_fuel_economy(-0.401), year(0.371) and engine_displacement(0.427)

An intresting thing we found is that the seller_rating and owner_count don't show strong relationship to car price from correlation checking, they're only -0.219 and -0.035 title

Median, Standard Deviation, Mean in a Pandas Dataframe

We can see from observing the simple statistic is that the median price aligns with the IQR as most of the cars are priced at around $21,700. However, what's interesting to note here is that the though the median and the mean are somewhat similar, there is a standard deviation of about \\$15,878. This could be due to the fact that the extremities in car pricing varies greatly from the average with the lost car priced at \$484 while the highest price car is at \\$3,299,995

Inter-quartile Range

We can also see that our dataset has the majority (~99%) of our cars being used cars as described in the title, with less than 1% being new.

Check the count of new cars

We then looked into the number of different types'cars that have accidents or no accidents. We found that 84706 cars have accidents and out of that, Suv/Crossover and Sedan cars have higher numbers of accidents, as their count are larger too. Perhaps we could say love SUV and Sedan cars more than others.

check the count of cars have accidents

Visulize the number of different type of cars that have accidents or not.

Visualize the number of frame damaged cars

check the count of frame_damaged cars

Data normality check

select all numberical data type from the final data

check all numerical type data's normality with pairplot

From the histogram, we may easily see city_fuel_economy,engine_displacement and highway_fuel_economy more normally distributed than other columns. As our sample size is large enough,we use Central Limit Theorem rule and assume they're approximately normal.

Correlation check

data correlation with Spearman method

data correlation check

Result: We may see most significan attributes are Horsepower ,Mileage,highway_fuel_economy,Year,engine_displacement

Creating the Heatmap from the Post

Visualize Attributes

Visualization with Raincloud Plot

This shows the new car is more expensive than used car, the car had accident is less expensive than the car without accident

Visualization with Pairplot using Seaborn

From above result, we rerun it again without city fuel economy and highway_fuel economy

Summary: The most significant attributes to the price are horsepower, mileage, year, highway_fuel_economy, engine_displacement, possible attributes are owner_count and seller_rating.

Visualize the horsepower and price with Scatter plot using Seaborn lmplot.

It shows positive relationship between horsepower and price

Visualize the engine_displacement and price with Scatter plot using Seaborn lmplot.

This shows positve relationship as well between price and engine_displacement

Visualize the higway_fuel_economy and price with Scatter plot using Seaborn lmplot.

This shows negtive relationship as well between price and highway_fuel_economy

The lower the mileage, the higher the price, and that goes with the general notion. However, the relation breaks when the mileage goes beyond 50k as the price does not vary much from that point.

Visualize the relationship between mileage and price with Scatter plot using Seaborn lmplot.

This shows negtive relationship as well between price and mileage

Explore Joint Attributes

When it comes to joint attributes, we would like to see if there's correlation between "car body type" and "city_fuel_economy".

First, we asked:

What's the most popular city fuel economy for used SUV, pickup Truck, sedan and convertible? (Please see Diagram 1)

Another question we asked was:

What's the highest amount of seating people are looking for when looking to buy a used car? (Please see Diagram 2)

Explore Attributes and Class

we are using price (for regression), has_accidents(for logistic regression) and body_type(for multiclass classifications) as the response variables. In this section, we will try to find some interesting relations involving these features.

'has-accidents' and 'is_new' are two booleans variables in our final dataset. We could have either or these these two values for the binary classification. But has_accidents feature is more balanced (false: 87% and true:13%) than that of is_new. That is the another driver, besides what is stated in the Business Understanding section, why we leaned on has_accidents for the binary response variable. However, we could have used is_new for the response variable. In that case, we should downscale or upscale, though the downscaling is preferred, training data, which would be an additional step.

One may think the drivers of Sendan, Minivan or wagon may be more cautious that those who drive SUV or Pickup Truck. But at least based on used car market as presented in this visual, there is not much difference in the driving behavior amoung drives of different body types. However, the drivers of coupe and convertible exhibit restriant while on the road because the damage of their vehicle would cost them dearly.

Coupe, SUV/Crossover, Convertible and Sedan have many outliers, whereas Wagon, Minivan and Vad do not have any outlier. In median used car price of 9 body types are with a close band.

Sedan, SUV/Crossover, Pickup Truck stay in the market much loger than Coupe, Wagon and Hackback do. It is surprise to see that Coupe, though it is an expensive category car, sells faster Sedan, which we believe a common people's car. Most likely supply demand or mispricing is causing this anamoly.

We may think that the used cars having accidents will stay longer in the market, but that is not the case with an exception of Van.

New Features

We further group numerical predictors, including "price", to help us to gain more insight.

This new feature is created using the attribute 'price'. We divide price into different price groups as following:

Obervations after we had used the new price group for data analysing:

Exceptional Work

In this section, we will use one hot encoding to encode the data set below

First, we encode the categorical features as numbers. (see below function)

Below shows a snap shot of what the final data looks like after categorical data has been encoded. You can see the body type is in a numerical representation, instead of a string (object) type, before being encoded.

As indicated in the heatmap in the sample statistics (above) section, in the future we want to further explore the strong correlation between "price" and "body_type".